12 research outputs found
RawHash: Enabling Fast and Accurate Real-Time Analysis of Raw Nanopore Signals for Large Genomes
Nanopore sequencers generate electrical raw signals in real-time while
sequencing long genomic strands. These raw signals can be analyzed as they are
generated, providing an opportunity for real-time genome analysis. An important
feature of nanopore sequencing, Read Until, can eject strands from sequencers
without fully sequencing them, which provides opportunities to computationally
reduce the sequencing time and cost. However, existing works utilizing Read
Until either 1) require powerful computational resources that may not be
available for portable sequencers or 2) lack scalability for large genomes,
rendering them inaccurate or ineffective.
We propose RawHash, the first mechanism that can accurately and efficiently
perform real-time analysis of nanopore raw signals for large genomes using a
hash-based similarity search. To enable this, RawHash ensures the signals
corresponding to the same DNA content lead to the same hash value, regardless
of the slight variations in these signals. RawHash achieves an accurate
hash-based similarity search via an effective quantization of the raw signals
such that signals corresponding to the same DNA content have the same quantized
value and, subsequently, the same hash value.
We evaluate RawHash on three applications: 1) read mapping, 2) relative
abundance estimation, and 3) contamination analysis. Our evaluations show that
RawHash is the only tool that can provide high accuracy and high throughput for
analyzing large genomes in real-time. When compared to the state-of-the-art
techniques, UNCALLED and Sigmap, RawHash provides 1) 25.8x and 3.4x better
average throughput and 2) an average speedup of 32.1x and 2.1x in the mapping
time, respectively.
Source code is available at https://github.com/CMU-SAFARI/RawHash
TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling Filtering
Basecalling is an essential step in nanopore sequencing analysis where the
raw signals of nanopore sequencers are converted into nucleotide sequences,
i.e., reads. State-of-the-art basecallers employ complex deep learning models
to achieve high basecalling accuracy. This makes basecalling
computationally-inefficient and memory-hungry; bottlenecking the entire genome
analysis pipeline. However, for many applications, the majority of reads do no
match the reference genome of interest (i.e., target reference) and thus are
discarded in later steps in the genomics pipeline, wasting the basecalling
computation. To overcome this issue, we propose TargetCall, the first fast and
widely-applicable pre-basecalling filter to eliminate the wasted computation in
basecalling. TargetCall's key idea is to discard reads that will not match the
target reference (i.e., off-target reads) prior to basecalling. TargetCall
consists of two main components: (1) LightCall, a lightweight neural network
basecaller that produces noisy reads; and (2) Similarity Check, which labels
each of these noisy reads as on-target or off-target by matching them to the
target reference. TargetCall filters out all off-target reads before
basecalling; and the highly-accurate but slow basecalling is performed only on
the raw signals whose noisy reads are labeled as on-target. Our thorough
experimental evaluations using both real and simulated data show that
TargetCall 1) improves the end-to-end basecalling performance of the
state-of-the-art basecaller by 3.31x while maintaining high (98.88%)
sensitivity in keeping on-target reads, 2) maintains high accuracy in
downstream analysis, 3) precisely filters out up to 94.71% of off-target reads,
and 4) achieves better performance, sensitivity, and generality compared to
prior works. We freely open-source TargetCall at
https://github.com/CMU-SAFARI/TargetCall
Accelerating Time Series Analysis via Processing using Non-Volatile Memories
Time Series Analysis (TSA) is a critical workload for consumer-facing
devices. Accelerating TSA is vital for many domains as it enables the
extraction of valuable information and predict future events. The
state-of-the-art algorithm in TSA is the subsequence Dynamic Time Warping
(sDTW) algorithm. However, sDTW's computation complexity increases
quadratically with the time series' length, resulting in two performance
implications. First, the amount of data parallelism available is significantly
higher than the small number of processing units enabled by commodity systems
(e.g., CPUs). Second, sDTW is bottlenecked by memory because it 1) has low
arithmetic intensity and 2) incurs a large memory footprint. To tackle these
two challenges, we leverage Processing-using-Memory (PuM) by performing in-situ
computation where data resides, using the memory cells. PuM provides a
promising solution to alleviate data movement bottlenecks and exposes immense
parallelism.
In this work, we present MATSA, the first MRAM-based Accelerator for Time
Series Analysis. The key idea is to exploit magneto-resistive memory crossbars
to enable energy-efficient and fast time series computation in memory. MATSA
provides the following key benefits: 1) it leverages high levels of parallelism
in the memory substrate by exploiting column-wise arithmetic operations, and 2)
it significantly reduces the data movement costs performing computation using
the memory cells. We evaluate three versions of MATSA to match the requirements
of different environments (e.g., embedded, desktop, or HPC computing) based on
MRAM technology trends. We perform a design space exploration and demonstrate
that our HPC version of MATSA can improve performance by 7.35x/6.15x/6.31x and
energy efficiency by 11.29x/4.21x/2.65x over server CPU, GPU and PNM
architectures, respectively
TuRaN: True Random Number Generation Using Supply Voltage Underscaling in SRAMs
Prior works propose SRAM-based TRNGs that extract entropy from SRAM arrays.
SRAM arrays are widely used in a majority of specialized or general-purpose
chips that perform the computation to store data inside the chip. Thus,
SRAM-based TRNGs present a low-cost alternative to dedicated hardware TRNGs.
However, existing SRAM-based TRNGs suffer from 1) low TRNG throughput, 2) high
energy consumption, 3) high TRNG latency, and 4) the inability to generate true
random numbers continuously, which limits the application space of SRAM-based
TRNGs. Our goal in this paper is to design an SRAM-based TRNG that overcomes
these four key limitations and thus, extends the application space of
SRAM-based TRNGs. To this end, we propose TuRaN, a new high-throughput,
energy-efficient, and low-latency SRAM-based TRNG that can sustain continuous
operation. TuRaN leverages the key observation that accessing SRAM cells
results in random access failures when the supply voltage is reduced below the
manufacturer-recommended supply voltage. TuRaN generates random numbers at high
throughput by repeatedly accessing SRAM cells with reduced supply voltage and
post-processing the resulting random faults using the SHA-256 hash function. To
demonstrate the feasibility of TuRaN, we conduct SPICE simulations on different
process nodes and analyze the potential of access failure for use as an entropy
source. We verify and support our simulation results by conducting real-world
experiments on two commercial off-the-shelf FPGA boards. We evaluate the
quality of the random numbers generated by TuRaN using the widely-adopted NIST
standard randomness tests and observe that TuRaN passes all tests. TuRaN
generates true random numbers with (i) an average (maximum) throughput of
1.6Gbps (1.812Gbps), (ii) 0.11nJ/bit energy consumption, and (iii) 278.46us
latency
BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches
Generating the hash values of short subsequences, called seeds, enables
quickly identifying similarities between genomic sequences by matching seeds
with a single lookup of their hash values. However, these hash values can be
used only for finding exact-matching seeds as the conventional hashing methods
assign distinct hash values for different seeds, including highly similar
seeds. Finding only exact-matching seeds causes either 1) increasing the use of
the costly sequence alignment or 2) limited sensitivity.
We introduce BLEND, the first efficient and accurate mechanism that can
identify both exact-matching and highly similar seeds with a single lookup of
their hash values, called fuzzy seeds matches. BLEND 1) utilizes a technique
called SimHash, that can generate the same hash value for similar sets, and 2)
provides the proper mechanisms for using seeds as sets with the SimHash
technique to find fuzzy seed matches efficiently.
We show the benefits of BLEND when used in read overlapping and read mapping.
For read overlapping, BLEND is faster by 2.6x-63.5x (on average 19.5x), has a
lower memory footprint by 0.9x-9.7x (on average 3.6x), and finds higher quality
overlaps leading to accurate de novo assemblies than the state-of-the-art tool,
minimap2. For read mapping, BLEND is faster by 0.7x-3.7x (on average 1.7x) than
minimap2. Source code is available at https://github.com/CMU-SAFARI/BLEND
SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations
Important workloads, such as machine learning and graph analytics
applications, heavily involve sparse linear algebra operations. These
operations use sparse matrix compression as an effective means to avoid storing
zeros and performing unnecessary computation on zero elements. However,
compression techniques like Compressed Sparse Row (CSR) that are widely used
today introduce significant instruction overhead and expensive pointer-chasing
operations to discover the positions of the non-zero elements. In this paper,
we identify the discovery of the positions (i.e., indexing) of non-zero
elements as a key bottleneck in sparse matrix-based workloads, which greatly
reduces the benefits of compression. We propose SMASH, a hardware-software
cooperative mechanism that enables highly-efficient indexing and storage of
sparse matrices. The key idea of SMASH is to explicitly enable the hardware to
recognize and exploit sparsity in data. To this end, we devise a novel software
encoding based on a hierarchy of bitmaps. This encoding can be used to
efficiently compress any sparse matrix, regardless of the extent and structure
of sparsity. At the same time, the bitmap encoding can be directly interpreted
by the hardware. We design a lightweight hardware unit, the Bitmap Management
Unit (BMU), that buffers and scans the bitmap hierarchy to perform
highly-efficient indexing of sparse matrices. SMASH exposes an expressive and
rich ISA to communicate with the BMU, which enables its use in accelerating any
sparse matrix computation. We demonstrate the benefits of SMASH on four use
cases that include sparse matrix kernels and graph analytics applications
Scrooge: a fast and memory-frugal genomic sequence aligner for CPUs, GPUs, and ASICs
Motivation: Pairwise sequence alignment is a very time-consuming step in common bioinformatics pipelines. Speeding up this step requires heuristics, efficient implementations, and/or hardware acceleration. A promising candidate for all of the above is the recently proposed GenASM algorithm. We identify and address three inefficiencies in the GenASM algorithm: it has a high amount of data movement, a large memory footprint, and does some unnecessary work. Results: We propose Scrooge, a fast and memory-frugal genomic sequence aligner. Scrooge includes three novel algorithmic improvements which reduce the data movement, memory footprint, and the number of operations in the GenASM algorithm. We provide efficient open-source implementations of the Scrooge algorithm for CPUs and GPUs, which demonstrate the significant benefits of our algorithmic improvements. For long reads, the CPU version of Scrooge achieves a 20.1x, 1.7x, and 2.1x speedup over KSW2, Edlib, and a CPU implementation of GenASM, respectively. The GPU version of Scrooge achieves a 4.0x, 80.4x, 6.8x, 12.6x, and 5.9x speedup over the CPU version of Scrooge, KSW2, Edlib, Darwin-GPU, and a GPU implementation of GenASM, respectively. We estimate an ASIC implementation of Scrooge to use 3.6x less chip area and 2.1x less power than a GenASM ASIC while maintaining the same throughput. Further, we systematically analyze the throughput and accuracy behavior of GenASM and Scrooge under various configurations. As the best configuration of Scrooge depends on the computing platform, we make several observations that can help guide future implementations of Scrooge.ISSN:1367-4803ISSN:1460-205
ALP: Alleviating CPU-Memory Data Movement Overheads in Memory-Centric Systems
Partitioning applications between NDP and host CPU cores causes inter-segment data movement overhead, which is caused by moving data generated from one segment (e.g., instructions, functions) and used in consecutive segments. Prior works take two approaches to this problem. The first class of works maps segments to NDP or host cores based on the properties of each segment, neglecting the inter-segment data movement overhead. The second class of works partitions applications based on the overall memory bandwidth saving of each segment, and does not offload each segment to the best-fitting core if they incur high inter-segment data movement. We show that 1) mapping each segment to its best-fitting core ideally can provide substantial benefits, and 2) the inter-segment data movement reduces this benefit significantly. To this end, we introduce ALP, a new programmer-transparent technique to leverage the performance benefits of NDP by alleviating the inter-segment data movement overhead between host and memory and enabling efficient partitioning of applications. ALP alleviates the inter-segment data movement overhead by proactively and accurately transferring the required data between the segments. This is based on the key observation that the instructions that generate the inter-segment data stay the same across different executions of a program on different inputs. ALP uses a compiler pass to identify these instructions and uses specialized hardware to transfer data between the host and NDP cores at runtime. ALP efficiently maps application segments to either host or NDP considering 1) the properties of each segment, 2) the inter-segment data movement overhead, and 3) whether this overhead can be alleviated in a timely manner. We evaluate ALP across a wide range of workloads and show on average 54.3% and 45.4% speedup compared to only-host CPU or only-NDP executions, respectively